Introduction¶
In today's global marketplace, understanding corporate ownership structures is crucial for businesses, investors, and regulators. Companies operate within complex parent-child relationships across domestic and international borders. Identifying whether a company is a "Domestic Ultimate" or a "Global Ultimate" reveals insights into its autonomy, global reach, and decision-making power.
This report explores the development of a Machine Learning Model aimed at accurately predicting whether a company is classified as a "Domestic Ultimate" or "Global Ultimate", based on its operational, financial, and structural characteristics. Ultimately, the model's ability to predict corporate structure seeks to provide actionable insights to assist in competitive analysis, investment strategies, and merger & acquisition decisions.
Setting Up¶
To start off, the necessary libraries to handle data manipulation, visualisation, and machine learning tasks, were imported.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import (
accuracy_score, classification_report, confusion_matrix, roc_curve, auc, ConfusionMatrixDisplay
)
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from imblearn.over_sampling import SMOTE
import xgboost as xgb
import lightgbm as lgb
import tensorflow as tf
import statsmodels.api as sm
import random
Data Cleaning¶
Loading and Preparing of Dataset¶
After loading the dataset, we dropped 2 columns as per question requirements.
df = pd.read_csv('Champions_Group_2025.csv')
df = df.drop(['Parent Company', 'Parent Country'], axis=1) # Exclude columns
Excluding Irrelevant Columns by Eye¶
We assessed the relevance of the remaining features. For those that do not seem to contribute meaningfully to our analysis, their removal is justified in the following table.
df = df.drop(['LATITUDE', 'LONGITUDE', 'AccountID', 'Industry', '8-Digit SIC Code', '8-Digit SIC Description', 'Fiscal Year End', 'Company Description'], axis=1)
Checking for Missing Values¶
Next, we examine the percentage of missing values in each column, which is important for the following reasons.
Ensure Data Integrity.
Identifying and addressing missing values helps maintain the accuracy and completeness of your dataset, which is vital for building trustworthy models.Prevent Algorithmic Errors.
Many machine learning algorithms require complete datasets. Unaddressed missing values can cause algorithms to fail or produce incorrect outputs.Avoid Bias.
If missing data is not handled properly, it can introduce bias into your analysis, leading to misleading conclusions.
A heatmap can first be leveraged to visualise the missing data.
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.show()
Then, we can programmatically identify the missing data.
print(df.isna().sum() / len(df))
Company 0.000000 SIC Code 0.000000 Year Found 0.014872 Ownership Type 0.000000 Square Footage 1.000000 Company Status (Active/Inactive) 0.000000 Employees (Single Site) 0.425022 Employees (Domestic Ultimate Total) 0.002707 Employees (Global Ultimate Total) 0.095059 Sales (Domestic Ultimate Total USD) 0.000000 Sales (Global Ultimate Total USD) 0.000000 Import/Export Status 0.773388 Is Domestic Ultimate 0.000000 Is Global Ultimate 0.000000 dtype: float64
Examining the percentages, we observed three features with relatively high levels of NaN values.
'Square Footage' contains no data at all (all NA values).
'Import/Export Status' has the next highest percentage of NA values (more than 50%). While this indicator would have been highly informative in assessing a company's scale, we cannot assume that the NA values indicate a complete absence of trade, and these missing values may very well stem from incomplete reporting or other data collection issues.
Almost half of the 'Employees (Single Site)' are NA values (42.5%). Although we could fill in missing values Mean or Median Imputation, removing this feature entirely is unlikely to have significant effect on our predictions, as there does not does not appear to be a strong, direct correlation between the number of employees at a particular site and the overall scale of a company. Ultimately, features that capture the overall size and reach of companies, both domestically and globally, such as total domestic or global employees would provide more meaningful insights.
Hence, these three features were removed.
df.isna().sum() / len(df)
df = df.drop(['Import/Export Status', 'Square Footage', 'Employees (Single Site)'], axis=1)
Handling Duplicates¶
We should eliminate duplicate entries in the 'Company' column so as to maintain data integrity and accuracy. Otherwise, duplicate records may distort information, leading to inconsistencies and erroneous predictions, undermining the model's performance and reliability.
Since our test returned False, there was no need to exclude duplicated companies.
Then, we deleted the Company column since company name was deemed irrelevant to its Ultimate status.
df.duplicated(subset=['Company']).any()
df = df.drop('Company', axis=1)
Removing Irrelevant Columns upon further Data Analysis¶
All companies' activity status were assessed, as inactive companies do not yield insights into whether they are classified as global or domestic ultimate entities.
Since our evaluation confirmed that all companies are active, 'Company Status (Active/Inactive)' was entirely removed from the dataset.
df['Company Status (Active/Inactive)'].unique()
df = df.drop('Company Status (Active/Inactive)', axis=1)
One-Hot Encoding on Categorical Variables¶
For non-numeric categorical variables such as 'Ownership Type', one-hot encoding is necessary because most machine learning models require numerical input for effective processing. Even though '2-digit SIC Code' appears to be numeric, it is actually a categorical variable representing industry classifications rather than continuous or ordinal data. This makes one-hot encoding useful for representing the distinct categories without imposing a misleading ordinal relationship between them.
For example, a '2-digit SIC Code' might represent:
- 20: Food and Kindred Products
- 35: Industrial and Commercial Machinery
- 50: Wholesale Trade - Durable Goods
If these codes were treated as continuous numeric values, our model might incorrectly assume that SIC Code 50 is somehow twice of SIC Code 25. One-hot encoding prevents this issue by creating binary columns for each unique SIC code, allowing the model to learn associations for each category independently without introducing spurious correlations based on numeric magnitude.
from sklearn.preprocessing import OneHotEncoder
# one-hot encoding 'Ownership Type'
print(df['Ownership Type'].unique())
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
encoded_data = encoder.fit_transform(df[['Ownership Type']])
encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(['Ownership Type']))
df = pd.concat([df, encoded_df], axis=1)
df = df.drop('Ownership Type', axis=1)
# one-hot encoding '2-digit SIC Code'
df = df.dropna(subset=['SIC Code'])
df['2-digit SIC Code'] = df[['SIC Code']] // 100
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
encoded_data = encoder.fit_transform(df[['2-digit SIC Code']])
encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(['2-digit SIC Code']))
df = pd.concat([df, encoded_df], axis=1)
df = df.drop(['SIC Code', '2-digit SIC Code'], axis=1)
['Private' 'Public' 'Non-Corporates' 'Partnership' 'Public Sector' 'Nonprofit']
Removing Rows with Null Values in Key Columns¶
As mentioned earlier, retaining incomplete records could lead to bias or distortion in the analysis, as missing data can obscure or skew relationships between variables. As such, we removed rows with missing values for features with some NA values, specifically 'Year Found', 'Employees (Domestic Ultimate Total)', and 'Employees (Global Ultimate Total)'. By excluding these rows, we can ensure that the analysis reflects the most accurate representation of the companies' histories, scales, and their effects on predictions.
df = df.dropna(subset=['Year Found', 'Employees (Domestic Ultimate Total)', 'Employees (Global Ultimate Total)'])
Ensuring Numeric Values¶
We used the pd.to_numeric() function to convert certain columns to numeric types. The errors='coerce' parameter ensures that any non-numeric values are converted to NaN (Not a Number) without causing errors. To assess the proportion of missing (NaN) values in each column, we applied df.isna(), which returns a DataFrame of the same shape as df, with True for NaN values and False otherwise. By using .sum(), we count the number of NaN values in each column, and dividing by len(df) (the total number of rows) converts these counts into proportions. We used these proportions to evaluate the extent of missing data in each column.
df.loc[:, 'Employees (Domestic Ultimate Total)'] = pd.to_numeric(df['Employees (Domestic Ultimate Total)'], errors='coerce')
df.loc[:, 'Employees (Global Ultimate Total)'] = pd.to_numeric(df['Employees (Global Ultimate Total)'], errors='coerce')
df.loc[:, 'Sales (Global Ultimate Total USD)'] = pd.to_numeric(df['Sales (Global Ultimate Total USD)'], errors='coerce')
df.loc[:, 'Sales (Domestic Ultimate Total USD)'] = pd.to_numeric(df['Sales (Domestic Ultimate Total USD)'], errors='coerce')
df.isna().sum() / len(df)
df.head()
| Year Found | Employees (Domestic Ultimate Total) | Employees (Global Ultimate Total) | Sales (Domestic Ultimate Total USD) | Sales (Global Ultimate Total USD) | Is Domestic Ultimate | Is Global Ultimate | Ownership Type_Non-Corporates | Ownership Type_Nonprofit | Ownership Type_Partnership | ... | 2-digit SIC Code_83 | 2-digit SIC Code_84 | 2-digit SIC Code_86 | 2-digit SIC Code_87 | 2-digit SIC Code_89 | 2-digit SIC Code_91 | 2-digit SIC Code_92 | 2-digit SIC Code_93 | 2-digit SIC Code_96 | 2-digit SIC Code_97 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1980.0 | 300.0 | 300.0 | 76973100 | 76973100 | 0 | 0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 1993.0 | 100.0 | 100.0 | 9499251 | 9499251 | 0 | 0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 2018.0 | 22.0 | 22.0 | 13738494 | 13738494 | 0 | 0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 2004.0 | 100.0 | 100.0 | 103745791 | 103745791 | 0 | 0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | 1986.0 | 33.0 | 33.0 | 60863682 | 60863682 | 1 | 1 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 88 columns
In reality, global sales, which includes domestic sales, should either be equal to or more than domestic sales. However, there are a few instances where this was not the case. As such, we carried out the following steps.
- We identified and quantified the instances where global sales are less than domestic sales, calculating the proportion of such cases within the dataset.
- We filtered the dataset to retain only the rows where global sales meet or exceed domestic sales.
This would ensure a more reliable and logical representation of the data.
num_rows = (df['Sales (Global Ultimate Total USD)'] < df['Sales (Domestic Ultimate Total USD)']).sum() # Count rows where A < B
total_rows = len(df) # Total number of rows
proportion = num_rows / total_rows # Compute proportion
print(f"Rows where A < B: {num_rows}/{total_rows} ({proportion:.2%})")
df = df[df['Sales (Global Ultimate Total USD)'] >= df['Sales (Domestic Ultimate Total USD)']]
# Check the new size of the DataFrame
print(f"New number of rows: {len(df)}")
Rows where A < B: 1834/25985 (7.06%) New number of rows: 24151
Visualizing Employee Distribution¶
The distribution of 'Employees (Domestic Ultimate Total)' was visualised through a histogram to better understand the data's characteristics.
Upon examination, we observed that the data is significantly right-skewed, indicating the presence of extreme high values.
bins = np.linspace(0, 5000, 20)
sns.histplot(df['Employees (Domestic Ultimate Total)'], bins=bins, kde=True)
plt.title('Employees (Domestic Ultimate Total) - Histogram')
plt.xlabel('Employees')
plt.ylabel('Frequency')
plt.xlim(0, 5000)
plt.show()
sns.histplot(df['Employees (Global Ultimate Total)'], bins=bins, kde=True)
plt.title('Employees (Global Ultimate Total) - Histogram')
plt.xlabel('Employees')
plt.ylabel('Frequency')
plt.xlim(0, 5000)
plt.show()
To address this skewness, we decided to manage these extremely high values, as they appear unrealistic. A review of several companies revealed that their employee counts do not match the inflated figures reflected in the dataset. By filtering out these outliers, we aim to enhance the accuracy and reliability of our analysis, ensuring that the insights derived are more representative of typical employee distributions... (YULUN + why are we not using Box Plot instead)
def remove_outliers(df, columns):
# Initialize a boolean mask for all rows (True = keep, False = remove)
mask = pd.Series(True, index=df.index)
for column in columns:
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
upper_bound = Q3 + 3 * (Q3 - Q1)
# Update the mask to exclude outliers in the current column
mask &= (df[column] >= 0) & (df[column] <= upper_bound)
# Apply the final mask to filter the DataFrame
filtered_df = df[mask]
# Check the new size of the DataFrame
print(f"New number of rows: {len(filtered_df)}")
return filtered_df
# List of columns to check for outliers
columns_to_check = [
'Employees (Domestic Ultimate Total)',
'Employees (Global Ultimate Total)',
'Sales (Global Ultimate Total USD)',
'Sales (Domestic Ultimate Total USD)'
]
# Remove outliers considering all columns together
df = remove_outliers(df, columns_to_check)
# Log-transform the specified columns (avoid taking log of zero or negative values)
columns_to_log_transform = [
'Employees (Domestic Ultimate Total)',
'Employees (Global Ultimate Total)',
'Sales (Domestic Ultimate Total USD)',
'Sales (Global Ultimate Total USD)'
]
# Apply log transformation
for column in columns_to_log_transform:
df[column] = np.log1p(df[column])
New number of rows: 16966
<ipython-input-218-2c80c9b24e8b>:41: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df[column] = np.log1p(df[column])
Visualising Class Imbalances¶
Visualising class imbalances is important to identify bias in the dataset. If one class significantly outnumbers the other, the model might favor the majority class, leading to poor generalisation for the minority class.
In this case, less than 5% of companies are Global Ultimate (meaning that more than 95% of companies are not Global Ultimate). If no data balancing techniques are employed, our predictive model is likely to achieve approximately 95% accuracy by predicting the majority class, but it will be futile in detecting "Global Ultimate" companies.
Therefore, we will be employing Synthetic Minority Over-sampling Technique (SMOTE) to generate synthetic samples for the minority class 'Is Not Global Ultimate' using k-nearest neighbors (KNN) during the Model Training process later on.
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Plot for 'Is Global Ultimate'
sns.barplot(x=df['Is Global Ultimate'].value_counts().index,
y=df['Is Global Ultimate'].value_counts().values,
palette="viridis", ax=axes[0])
axes[0].set_title("Class Distribution: Is Global Ultimate")
axes[0].set_xlabel("Class Label")
axes[0].set_ylabel("Count")
# Plot for 'Is Domestic Ultimate'
sns.barplot(x=df['Is Domestic Ultimate'].value_counts().index,
y=df['Is Domestic Ultimate'].value_counts().values,
palette="magma", ax=axes[1])
axes[1].set_title("Class Distribution: Is Domestic Ultimate")
axes[1].set_xlabel("Class Label")
plt.tight_layout()
plt.show()
<ipython-input-219-c7d9498afd27>:4: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x=df['Is Global Ultimate'].value_counts().index, <ipython-input-219-c7d9498afd27>:12: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x=df['Is Domestic Ultimate'].value_counts().index,
Uncovering Patterns in the Cleaned Dataset Using Unsupervised Learning¶
We started with Unsupervised Learning Techniques to uncover hidden patterns or groupings in the cleaned data before moving on to Supervised Learning Techniques. This approach was chosen because using a high number of features directly in supervised models could lead to issues such as overfitting, high computation time, and model complexity. Unsupervised learning would then offer an opportunity to reduce dimensionality and identify the most important underlying features before training more complex supervised models.
We considered 3 types of Unsupervised Learning Techniques in this process, all of which requiring X to be scaled.
X = df.drop(columns=['Is Global Ultimate', 'Is Domestic Ultimate']) # Exclude target columns
X = X.iloc[:, 1:] # Select all columns except the first one
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Convert scaled data back to a pandas DataFrame with original feature names
X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)
K-means Clustering¶
K-means was the first technique considered.
However, we deemed that it was unsuitable for our dataset due to the fact that Euclidean distance is used as the measure of similarity. This measure requires variables to be continuous, as Euclidean distance computes the direct spatial distance between data points. Since most of our variables were categorical, K-means is unlikely able to meaningfully assess similarity between points.
Nevertheless, we attempted to conduct K-means for variables such as 'Employees (Domestic Ultimate Total)', 'Employees (Global Ultimate Total)', 'Sales (Domestic Ultimate Total USD)', 'Sales (Global Ultimate Total USD)'
But no clear elbow point was found, and we did not find a clear way to cluster our data. Moreover, experiments where we included k-means data seemed to worsen model performance, leading us to exclude k-means clusters from our built dataset.
features = df[['Employees (Domestic Ultimate Total)', 'Employees (Global Ultimate Total)',
'Sales (Domestic Ultimate Total USD)', 'Sales (Global Ultimate Total USD)']]
# Standardize the features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)
# Initialize lists to store evaluation metrics
inertia = []
silhouette_scores = []
cluster_range = range(2, 11) # Start from 2 clusters for silhouette score
# Perform KMeans clustering for each number of clusters and evaluate
for k in cluster_range:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(scaled_features) # Using standardized features
# Inertia (Elbow Method)
inertia.append(kmeans.inertia_)
# Silhouette Score
labels = kmeans.labels_
score = silhouette_score(scaled_features, labels)
silhouette_scores.append(score)
print(f"Clusters: {k}, Inertia: {kmeans.inertia_:.2f}, Silhouette Score: {score:.4f}")
# Plot Elbow Method (Inertia)
plt.figure(figsize=(14, 6))
plt.subplot(1, 2, 1)
plt.plot(cluster_range, inertia, marker='o', linestyle='--')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia (SSE)')
plt.title('Elbow Method (Inertia)')
plt.grid(True)
# Plot Silhouette Scores
plt.subplot(1, 2, 2)
plt.plot(cluster_range, silhouette_scores, marker='o', linestyle='--', color='orange')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score')
plt.grid(True)
plt.tight_layout()
plt.show()
Clusters: 2, Inertia: 25865.24, Silhouette Score: 0.5984 Clusters: 3, Inertia: 23693.23, Silhouette Score: 0.5611 Clusters: 4, Inertia: 17828.37, Silhouette Score: 0.5229 Clusters: 5, Inertia: 13336.45, Silhouette Score: 0.5588 Clusters: 6, Inertia: 10610.18, Silhouette Score: 0.5589 Clusters: 7, Inertia: 9531.70, Silhouette Score: 0.5469 Clusters: 8, Inertia: 8372.35, Silhouette Score: 0.5569 Clusters: 9, Inertia: 7887.65, Silhouette Score: 0.5594 Clusters: 10, Inertia: 7293.11, Silhouette Score: 0.5636
Hierarchical Clustering¶
Next, we experimented with Hierarchical Clustering, specifically using Complete Linkage as the dissimilarity measure. This approach attempts to ensure more balanced clusters by considering the maximum distance between data points in different clusters.
However, the resulting dendrogram still appeared elongated and imbalanced, resembling the results of Single Linkage (which considers the minimum distance between points in different clusters). This indicated that the clustering did not produce clear, actionable groupings, making it difficult to identify meaningful categories for subsequent analysis.
Moreover, this model is highly-time consuming in terms of its run time, and takes up too much RAM too. Therefore, we deemed this technique as unsuitable.
Principal Component Analysis (PCA)¶
Given the challenges faced with clustering techniques, we proceeded with Principal Component Analysis (PCA), with the goal of reduce the dimensionality of the dataset for better model efficiency and interpretability without losing key information. This helped reduce the number of features while retaining the most meaningful variance in the dataset.
To check for multicollinearity between features in our dataset (as PCA works better when features are highly correlated), a correlation matrix was plotted. The matrix reveals signficant correlation between the employee numbers and the sales figures, perhaps because these columns are all correlated with company size.
There is also some correlation between Ownership Types columns. This motivated us to conduct Principle Component Analysis and include PCs that account for higher percentages of variance in our training dataset.
plt.figure(figsize=(20, 16)) # Increase figure size
sns.heatmap(
corr_matrix,
annot=False, # Set to True if you want to see the correlation values
fmt=".2f",
cmap="coolwarm",
linewidths=0.1,
vmin=-1, vmax=1,
xticklabels=corr_matrix.columns,
yticklabels=corr_matrix.columns
)
# Rotate labels
plt.xticks(rotation=45, ha='right', fontsize=10)
plt.yticks(fontsize=10)
# Add title
plt.title("Improved Correlation Matrix of Features", fontsize=16)
plt.show()
pca = PCA()
X_pca = pca.fit_transform(X_scaled)
explained_variance = pca.explained_variance_ratio_
plt.figure(figsize=(8, 6))
plt.plot(range(1, len(explained_variance) + 1), explained_variance * 100, marker='o')
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio (%)')
plt.xticks(range(1, len(explained_variance) + 1))
plt.grid(True)
plt.show()
Based on the Scree Plot, we identified that the elbow point occurred at the 7th Principal Component, leading us to explore the top contributing features for the first 7 PCs.
Viewing the contributing features, we decided to include the first 3 PCs, which explain about 8.4% of variance, into our model because they were the most interpretable.
PC1, which includes employee count and sales figures heavily, reflects our observations with the correlation matrix. We associate PC1 with the size of a company and its scale of operations, and we deem this as relevant to the hierarchy of a company.
PC2, also reflects our correlation matrix observations. It includes details on ownership type of a company.
PC3 appears to represent the company's degree of control upon other companies, which we believe is highly relevant to our analysis of company hierarchy. SIC 67xx includes holding companies, while SIC 91xx and 97xx are involved in governmental affairs and national security.
pca = PCA(n_components=7) # Select the first 7 principal components
X_pca = pca.fit_transform(X_scaled)
# Get component loadings (feature contributions to PCs)
loadings = pd.DataFrame(pca.components_.T, index=X.columns, columns=[f'PC{i+1}' for i in range(7)])
# Display the top features contributing to each principal component
print("Top contributing features for each principal component:")
for i in range(7):
print(f"\nPrincipal Component {i+1}:")
print(loadings.iloc[:, i].abs().sort_values(ascending=False).head(5)) # Top 5 most important features
# Plot the heatmap for feature contributions
plt.figure(figsize=(10, 6))
sns.heatmap(loadings, cmap='coolwarm', annot=True, fmt=".2f")
plt.title("Feature Contributions to Principal Components")
plt.show()
df.isna().sum()
Top contributing features for each principal component: Principal Component 1: Employees (Global Ultimate Total) 0.487527 Employees (Domestic Ultimate Total) 0.483395 Sales (Global Ultimate Total USD) 0.476036 Sales (Domestic Ultimate Total USD) 0.459822 2-digit SIC Code_67 0.150521 Name: PC1, dtype: float64 Principal Component 2: Ownership Type_Private 0.653459 Ownership Type_Public 0.565429 Ownership Type_Public Sector 0.266115 Ownership Type_Partnership 0.189404 2-digit SIC Code_67 0.163747 Name: PC2, dtype: float64 Principal Component 3: Ownership Type_Public Sector 0.632434 2-digit SIC Code_91 0.440947 2-digit SIC Code_97 0.423380 Ownership Type_Public 0.291762 2-digit SIC Code_67 0.191354 Name: PC3, dtype: float64 Principal Component 4: 2-digit SIC Code_67 0.669523 2-digit SIC Code_87 0.341124 2-digit SIC Code_73 0.284551 Sales (Domestic Ultimate Total USD) 0.242417 Ownership Type_Partnership 0.212069 Name: PC4, dtype: float64 Principal Component 5: 2-digit SIC Code_86 0.508301 Ownership Type_Nonprofit 0.425983 Ownership Type_Partnership 0.417780 2-digit SIC Code_81 0.412298 Ownership Type_Non-Corporates 0.371910 Name: PC5, dtype: float64 Principal Component 6: 2-digit SIC Code_81 0.507814 Ownership Type_Partnership 0.479628 2-digit SIC Code_86 0.356899 Ownership Type_Nonprofit 0.309814 Ownership Type_Public 0.304614 Name: PC6, dtype: float64 Principal Component 7: 2-digit SIC Code_92 0.567844 Ownership Type_Non-Corporates 0.516358 Ownership Type_Nonprofit 0.451946 2-digit SIC Code_86 0.314844 2-digit SIC Code_73 0.175513 Name: PC7, dtype: float64
| 0 | |
|---|---|
| Year Found | 0 |
| Employees (Domestic Ultimate Total) | 0 |
| Employees (Global Ultimate Total) | 0 |
| Sales (Domestic Ultimate Total USD) | 0 |
| Sales (Global Ultimate Total USD) | 0 |
| ... | ... |
| 2-digit SIC Code_91 | 0 |
| 2-digit SIC Code_92 | 0 |
| 2-digit SIC Code_93 | 0 |
| 2-digit SIC Code_96 | 0 |
| 2-digit SIC Code_97 | 0 |
88 rows × 1 columns
# Adding Principal Components to df
df_pca = pd.DataFrame(X_pca[:, :3], columns=[f'PC{i+1}' for i in range(3)])
# Reset indices
df_pca.reset_index(drop=True, inplace=True)
df.reset_index(drop=True, inplace=True)
# Concatenate the dataframes
df = pd.concat([df, df_pca], axis=1)
# Display the first few rows
df.head()
| Year Found | Employees (Domestic Ultimate Total) | Employees (Global Ultimate Total) | Sales (Domestic Ultimate Total USD) | Sales (Global Ultimate Total USD) | Is Domestic Ultimate | Is Global Ultimate | Ownership Type_Non-Corporates | Ownership Type_Nonprofit | Ownership Type_Partnership | ... | 2-digit SIC Code_87 | 2-digit SIC Code_89 | 2-digit SIC Code_91 | 2-digit SIC Code_92 | 2-digit SIC Code_93 | 2-digit SIC Code_96 | 2-digit SIC Code_97 | PC1 | PC2 | PC3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1993.0 | 4.615121 | 4.615121 | 16.066724 | 16.066724 | 0 | 0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.600046 | 0.293654 | -0.465165 |
| 1 | 2018.0 | 3.135494 | 3.135494 | 16.435712 | 16.435712 | 0 | 0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.545816 | 0.102720 | -0.530836 |
| 2 | 1986.0 | 3.526361 | 3.526361 | 17.924147 | 17.924147 | 1 | 1 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.834285 | 0.235981 | -0.697102 |
| 3 | 2008.0 | 2.639057 | 2.639057 | 13.138497 | 13.138497 | 0 | 0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | -0.947151 | 0.328711 | 0.536712 |
| 4 | 2006.0 | 2.197225 | 2.197225 | 12.359862 | 12.359862 | 0 | 0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | -1.827995 | 0.225277 | 0.620008 |
5 rows × 91 columns
Utilising newly-formed categories in Training and Testing of Supervised Learning Models¶
Model Training¶
Using a 80/20 ratio, observations in the dataset were split into two segments: (1) ... obs. in train set for training purposes and (2) ... obs in test set for testing purposes, as reflective in the following codes for each Machine Learning method.
We will be training and testing two models to separately predict whether the company is Global or Local Ultimate, or both.
Logistic Regression¶
def train_logistic_regression(X, y, target_name, apply_smote=False):
# Check class balance
class_counts = y.value_counts()
print(f"{target_name} Class Distribution:\n{class_counts}\n")
# Apply SMOTE only for "Is Global Ultimate"
if apply_smote:
smote = SMOTE(sampling_strategy='auto', random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)
else:
X_resampled, y_resampled = X, y
# Split into train-test sets using resampled data
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42, stratify=y_resampled)
# Initialize and standardize the features (scaling after resampling)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Initialize and train the Logistic Regression model
model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_train_scaled, y_train)
# Predictions
y_pred = model.predict(X_test_scaled)
y_prob = model.predict_proba(X_test_scaled)[:, 1] # Get probability scores for positive class
# Compute accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"{target_name} - Model Accuracy: {accuracy * 100:.2f}%")
# Classification Report
print(f"Classification Report for {target_name}:\n{classification_report(y_test, y_pred)}")
# Compute and display confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=["Not " + target_name, target_name],
yticklabels=["Not " + target_name, target_name])
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title(f"Confusion Matrix for {target_name}")
plt.show()
# Compute ROC curve
fpr, tpr, _ = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
# Plot ROC Curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label=f'AUC = {roc_auc:.2f}')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--') # Diagonal reference line
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title(f'ROC Curve for {target_name}')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()
# Return trained model and metrics
return model, accuracy
# Model 1: Predicting "Is Global Ultimate"
y_global = df['Is Global Ultimate']
log_model_global, acc_global = train_logistic_regression(X_scaled, y_global, "Is Global Ultimate", apply_smote=True)
# Model 2: Predicting "Is Domestic Ultimate"
y_domestic = df['Is Domestic Ultimate']
log_model_domestic, acc_domestic = train_logistic_regression(X_scaled, y_domestic, "Is Domestic Ultimate", apply_smote=False)
Is Global Ultimate Class Distribution:
Is Global Ultimate
0 10298
1 6668
Name: count, dtype: int64
Is Global Ultimate - Model Accuracy: 66.26%
Classification Report for Is Global Ultimate:
precision recall f1-score support
0 0.71 0.55 0.62 2060
1 0.63 0.78 0.70 2060
accuracy 0.66 4120
macro avg 0.67 0.66 0.66 4120
weighted avg 0.67 0.66 0.66 4120
Is Domestic Ultimate Class Distribution:
Is Domestic Ultimate
0 9392
1 7574
Name: count, dtype: int64
Is Domestic Ultimate - Model Accuracy: 63.70%
Classification Report for Is Domestic Ultimate:
precision recall f1-score support
0 0.68 0.65 0.66 1879
1 0.59 0.63 0.61 1515
accuracy 0.64 3394
macro avg 0.63 0.64 0.63 3394
weighted avg 0.64 0.64 0.64 3394
Decision Tree¶
def train_decision_tree(X, y, target_name, max_depth=5, apply_smote=False):
# Check class balance
class_counts = y.value_counts()
print(f"{target_name} Class Distribution:\n{class_counts}\n")
# Apply SMOTE if requested
if apply_smote:
smote = SMOTE(sampling_strategy='auto', random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)
else:
X_resampled, y_resampled = X, y
# Split into train-test sets using resampled data
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42, stratify=y_resampled)
# Initialize and train the Decision Tree Classifier with max_depth
clf = DecisionTreeClassifier(random_state=42, max_depth=max_depth)
clf.fit(X_train, y_train)
# Predictions
y_pred = clf.predict(X_test)
y_prob = clf.predict_proba(X_test)[:, 1] # Get probability scores for the positive class
# Compute accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"{target_name} - Model Accuracy: {accuracy * 100:.2f}%\n")
# Classification Report
print(f"Classification Report for {target_name}:\n{classification_report(y_test, y_pred)}")
# Compute and display confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=["Not " + target_name, target_name],
yticklabels=["Not " + target_name, target_name])
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title(f"Confusion Matrix for {target_name}")
plt.show()
# Compute ROC curve
fpr, tpr, _ = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
# Plot ROC Curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label=f'AUC = {roc_auc:.2f}')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--') # Diagonal reference line
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title(f'ROC Curve for {target_name}')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()
# Plot Decision Tree with black lines (pruned style)
plt.figure(figsize=(20, 10))
plot_tree(
clf,
filled=True,
feature_names=df.columns,
class_names=['0', '1'],
rounded=True,
fontsize=12,
proportion=True, # Make the decision tree more compact
node_ids=True, # Show node IDs
label='all' # Show labels for all nodes
)
plt.title(f"Pruned Decision Tree for {target_name}")
plt.show()
# Return trained model and metrics
return clf, accuracy
# Model 1: Predicting "Is Global Ultimate"
y_global = df['Is Global Ultimate']
clf_global, acc_global = train_decision_tree(X_scaled, y_global, "Is Global Ultimate", max_depth=5, apply_smote=True)
# Model 2: Predicting "Is Domestic Ultimate"
y_domestic = df['Is Domestic Ultimate']
clf_domestic, acc_domestic = train_decision_tree(X_scaled, y_domestic, "Is Domestic Ultimate", max_depth=5, apply_smote=False)
Is Global Ultimate Class Distribution:
Is Global Ultimate
0 10298
1 6668
Name: count, dtype: int64
Is Global Ultimate - Model Accuracy: 72.11%
Classification Report for Is Global Ultimate:
precision recall f1-score support
0 0.87 0.52 0.65 2060
1 0.66 0.92 0.77 2060
accuracy 0.72 4120
macro avg 0.76 0.72 0.71 4120
weighted avg 0.76 0.72 0.71 4120
Is Domestic Ultimate Class Distribution:
Is Domestic Ultimate
0 9392
1 7574
Name: count, dtype: int64
Is Domestic Ultimate - Model Accuracy: 68.62%
Classification Report for Is Domestic Ultimate:
precision recall f1-score support
0 0.86 0.51 0.64 1879
1 0.60 0.90 0.72 1515
accuracy 0.69 3394
macro avg 0.73 0.71 0.68 3394
weighted avg 0.75 0.69 0.68 3394
def train_random_forest(X, y, target_name, max_depth=5, n_estimators=100, apply_smote=False, feature_names=None):
# Check class balance
class_counts = y.value_counts()
print(f"{target_name} Class Distribution:\n{class_counts}\n")
# Apply SMOTE if requested
if apply_smote:
smote = SMOTE(sampling_strategy='auto', random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y) # X is the original DataFrame
else:
X_resampled, y_resampled = X, y
# Split into train-test sets using resampled data
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42, stratify=y_resampled)
# Initialize and train the Random Forest Classifier
rf = RandomForestClassifier(random_state=42, max_depth=max_depth, n_estimators=n_estimators)
rf.fit(X_train, y_train)
# Predictions
y_pred = rf.predict(X_test)
y_prob = rf.predict_proba(X_test)[:, 1] # Get probability scores for the positive class
# Compute accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"{target_name} - Model Accuracy: {accuracy * 100:.2f}%\n")
# Classification Report
print(f"Classification Report for {target_name}:\n{classification_report(y_test, y_pred)}")
# Compute and display confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=["Not " + target_name, target_name],
yticklabels=["Not " + target_name, target_name])
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title(f"Confusion Matrix for {target_name}")
plt.show()
# Compute ROC curve
fpr, tpr, _ = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
# Plot ROC Curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label=f'AUC = {roc_auc:.2f}')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--') # Diagonal reference line
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title(f'ROC Curve for {target_name}')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()
# Plot Random Forest feature importances (top 10)
feature_importances = rf.feature_importances_
sorted_idx = feature_importances.argsort()[-10:] # Get the indices of the top 10 features
sorted_importances = feature_importances[sorted_idx]
plt.figure(figsize=(10, 6))
plt.barh(range(len(sorted_idx)), sorted_importances, align='center')
plt.yticks(range(len(sorted_idx)), feature_names[sorted_idx])
plt.xlabel('Feature Importance')
plt.title(f'Top 10 Feature Importances for {target_name}')
plt.show()
# Return trained model and metrics
return rf, accuracy
# Create a copy of the feature names before scaling
feature_names = X.columns
# Target variable: "Is Global Ultimate"
y_global = df['Is Global Ultimate']
# Train the Random Forest model for "Is Global Ultimate" with SMOTE
rf_global, acc_global = train_random_forest(X_scaled, y_global, "Is Global Ultimate", max_depth=5, apply_smote=True, feature_names=feature_names)
# Optionally, if you'd like to train for "Is Domestic Ultimate" as well
y_domestic = df['Is Domestic Ultimate']
rf_domestic, acc_domestic = train_random_forest(X_scaled, y_domestic, "Is Domestic Ultimate", max_depth=5, apply_smote=True, feature_names=feature_names)
Is Global Ultimate Class Distribution:
Is Global Ultimate
0 10298
1 6668
Name: count, dtype: int64
Is Global Ultimate - Model Accuracy: 71.17%
Classification Report for Is Global Ultimate:
precision recall f1-score support
0 0.81 0.55 0.66 2060
1 0.66 0.88 0.75 2060
accuracy 0.71 4120
macro avg 0.74 0.71 0.70 4120
weighted avg 0.74 0.71 0.70 4120
Is Domestic Ultimate Class Distribution:
Is Domestic Ultimate
0 9392
1 7574
Name: count, dtype: int64
Is Domestic Ultimate - Model Accuracy: 70.19%
Classification Report for Is Domestic Ultimate:
precision recall f1-score support
0 0.79 0.55 0.65 1879
1 0.66 0.85 0.74 1878
accuracy 0.70 3757
macro avg 0.72 0.70 0.70 3757
weighted avg 0.72 0.70 0.70 3757
Neural Network (Chosen Model)¶
Our last attempt uses neural networks to classify entries as Domestic Ultimate or Global Ultimate using TensorFlow and Keras. The goal is to predict these binary outcomes based on the provided features using two separate neural network architectures.
We created an initial model with build_neural_network with 3 layers:
Input Layer with 64 neurons (ReLU activation)
Hidden Layer with 32 neurons (ReLU activation)
Output Layer with 1 neuron (Sigmoid activation for binary classification)
After experimenting with model architecture and further finetuning, we found that an additional hidden layer improved prediction power significantly, and thus we improved neural network model performance with build_neural_network_v4 which includes 4 layers:
Input Layer with 64 neurons (ReLU activation)
Two Hidden Layers with 32 and 16 neurons (ReLU activation)
Output Layer with 1 neuron (Sigmoid activation)
The train_and_evaluate function performs the following steps:
Early Stopping: Monitors validation loss and stops training if no improvement is detected for 10 consecutive epochs.
Training: The model is trained for up to 100 epochs with a batch size of 32.
Evaluation: The model's performance is evaluated using test data.
We used accuracy and loss curves, a confusion matrix, and an ROC curve to visualise model performance. We noted that v4 showed improvements across most metrics.
# Load your DataFrame (replace 'your_dataframe' with the actual DataFrame)
# df = your_dataframe
# Define target variables
target_domestic = 'Is Domestic Ultimate'
target_global = 'Is Global Ultimate'
# Split features and targets
X = df.drop(columns=[target_domestic, target_global])
y_domestic = df[target_domestic]
y_global = df[target_global]
# Split into train and test sets (80/20 split)
X_train, X_test, y_train_dom, y_test_dom = train_test_split(X, y_domestic, test_size=0.2, random_state=42)
_, _, y_train_glob, y_test_glob = train_test_split(X, y_global, test_size=0.2, random_state=42)
# Scale the features using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
def build_neural_network(input_shape):
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu', input_shape=(input_shape,)),
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid') # Binary classification
])
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
return model
def build_neural_network_v4(input_shape):
from tensorflow.keras import layers
model = tf.keras.Sequential([
layers.Dense(64, activation='relu', input_shape=(input_shape,)),
layers.Dense(32, activation='relu'),
layers.Dense(16, activation='relu'),
layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
return model
# Set random seeds for reproducibility
def set_seed(seed=42):
np.random.seed(seed)
random.seed(seed)
tf.random.set_seed(seed)
def train_and_evaluate(model, X_train, y_train, X_test, y_test, model_name):
from tensorflow.keras.callbacks import EarlyStopping
# Early stopping to avoid overfitting
early_stopping = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
# Train the model
history = model.fit(X_train, y_train,
epochs=100,
batch_size=32,
validation_split=0.2,
callbacks=[early_stopping],
verbose=1)
# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print(f"\n{model_name} Test Accuracy: {accuracy:.3f}")
# Plot accuracy and loss
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Val Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.title(f'{model_name} Accuracy')
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Val Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title(f'{model_name} Loss')
plt.legend()
plt.tight_layout()
plt.show()
# Generate and display confusion matrix
y_pred = (model.predict(X_test) > 0.5).astype(int)
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=[0, 1])
disp.plot(cmap=plt.cm.Blues)
plt.title(f'{model_name} Confusion Matrix')
plt.show()
# Compute ROC curve and AUC
y_prob = model.predict(X_test) # This will be a single column for binary classification
if y_prob.shape[1] == 1:
# In case of binary classification, y_prob is a 1D array with probabilities for the positive class
y_prob = y_prob.flatten() # Convert to 1D array
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label=f'AUC = {roc_auc:.2f}')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--') # Diagonal reference line
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title(f'ROC Curve for {model_name}')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()
# Train and evaluate for Domestic Ultimate
set_seed()
train_and_evaluate(build_neural_network(X_train_scaled.shape[1]),
X_train_scaled, y_train_dom, X_test_scaled, y_test_dom,
"Domestic Ultimate")
# Train and evaluate for Global Ultimate
set_seed()
train_and_evaluate(build_neural_network(X_train_scaled.shape[1]),
X_train_scaled, y_train_glob, X_test_scaled, y_test_glob,
"Global Ultimate")
Epoch 1/100
/usr/local/lib/python3.11/dist-packages/keras/src/layers/core/dense.py:87: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(activity_regularizer=activity_regularizer, **kwargs)
340/340 ━━━━━━━━━━━━━━━━━━━━ 3s 3ms/step - accuracy: 0.6151 - loss: 0.6610 - val_accuracy: 0.6913 - val_loss: 0.5942 Epoch 2/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.6944 - loss: 0.5885 - val_accuracy: 0.6983 - val_loss: 0.5703 Epoch 3/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7109 - loss: 0.5615 - val_accuracy: 0.7098 - val_loss: 0.5582 Epoch 4/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7202 - loss: 0.5460 - val_accuracy: 0.7157 - val_loss: 0.5513 Epoch 5/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7237 - loss: 0.5366 - val_accuracy: 0.7223 - val_loss: 0.5459 Epoch 6/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7274 - loss: 0.5292 - val_accuracy: 0.7241 - val_loss: 0.5420 Epoch 7/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7305 - loss: 0.5234 - val_accuracy: 0.7260 - val_loss: 0.5393 Epoch 8/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7318 - loss: 0.5193 - val_accuracy: 0.7289 - val_loss: 0.5375 Epoch 9/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - accuracy: 0.7345 - loss: 0.5155 - val_accuracy: 0.7333 - val_loss: 0.5354 Epoch 10/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.7353 - loss: 0.5123 - val_accuracy: 0.7344 - val_loss: 0.5355 Epoch 11/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7359 - loss: 0.5094 - val_accuracy: 0.7322 - val_loss: 0.5346 Epoch 12/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7394 - loss: 0.5063 - val_accuracy: 0.7326 - val_loss: 0.5337 Epoch 13/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7415 - loss: 0.5042 - val_accuracy: 0.7355 - val_loss: 0.5329 Epoch 14/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7419 - loss: 0.5015 - val_accuracy: 0.7319 - val_loss: 0.5325 Epoch 15/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7436 - loss: 0.4994 - val_accuracy: 0.7341 - val_loss: 0.5316 Epoch 16/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7443 - loss: 0.4974 - val_accuracy: 0.7337 - val_loss: 0.5312 Epoch 17/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7460 - loss: 0.4956 - val_accuracy: 0.7344 - val_loss: 0.5310 Epoch 18/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7469 - loss: 0.4937 - val_accuracy: 0.7348 - val_loss: 0.5307 Epoch 19/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 4ms/step - accuracy: 0.7466 - loss: 0.4918 - val_accuracy: 0.7359 - val_loss: 0.5318 Epoch 20/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.7493 - loss: 0.4901 - val_accuracy: 0.7341 - val_loss: 0.5314 Epoch 21/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7492 - loss: 0.4882 - val_accuracy: 0.7333 - val_loss: 0.5311 Epoch 22/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7516 - loss: 0.4870 - val_accuracy: 0.7337 - val_loss: 0.5312 Epoch 23/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7516 - loss: 0.4854 - val_accuracy: 0.7330 - val_loss: 0.5309 Epoch 24/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7527 - loss: 0.4839 - val_accuracy: 0.7341 - val_loss: 0.5306 Epoch 25/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7531 - loss: 0.4824 - val_accuracy: 0.7341 - val_loss: 0.5300 Epoch 26/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7543 - loss: 0.4812 - val_accuracy: 0.7333 - val_loss: 0.5309 Epoch 27/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7542 - loss: 0.4793 - val_accuracy: 0.7348 - val_loss: 0.5303 Epoch 28/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7553 - loss: 0.4778 - val_accuracy: 0.7348 - val_loss: 0.5299 Epoch 29/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - accuracy: 0.7570 - loss: 0.4765 - val_accuracy: 0.7352 - val_loss: 0.5304 Epoch 30/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.7559 - loss: 0.4752 - val_accuracy: 0.7341 - val_loss: 0.5306 Epoch 31/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7595 - loss: 0.4744 - val_accuracy: 0.7355 - val_loss: 0.5300 Epoch 32/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7600 - loss: 0.4720 - val_accuracy: 0.7337 - val_loss: 0.5321 Epoch 33/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7606 - loss: 0.4712 - val_accuracy: 0.7370 - val_loss: 0.5301 Epoch 34/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7623 - loss: 0.4693 - val_accuracy: 0.7344 - val_loss: 0.5308 Epoch 35/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7620 - loss: 0.4684 - val_accuracy: 0.7344 - val_loss: 0.5313 Epoch 36/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.7631 - loss: 0.4671 - val_accuracy: 0.7348 - val_loss: 0.5304 Epoch 37/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7644 - loss: 0.4655 - val_accuracy: 0.7366 - val_loss: 0.5318 Epoch 38/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.7655 - loss: 0.4651 - val_accuracy: 0.7392 - val_loss: 0.5321 107/107 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.7336 - loss: 0.5587 Domestic Ultimate Test Accuracy: 0.737
107/107 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
107/107 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
Epoch 1/100
/usr/local/lib/python3.11/dist-packages/keras/src/layers/core/dense.py:87: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(activity_regularizer=activity_regularizer, **kwargs)
340/340 ━━━━━━━━━━━━━━━━━━━━ 3s 4ms/step - accuracy: 0.6412 - loss: 0.6363 - val_accuracy: 0.7072 - val_loss: 0.5776 Epoch 2/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7059 - loss: 0.5712 - val_accuracy: 0.7109 - val_loss: 0.5555 Epoch 3/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7212 - loss: 0.5450 - val_accuracy: 0.7219 - val_loss: 0.5422 Epoch 4/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7295 - loss: 0.5295 - val_accuracy: 0.7274 - val_loss: 0.5345 Epoch 5/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7370 - loss: 0.5187 - val_accuracy: 0.7370 - val_loss: 0.5278 Epoch 6/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7417 - loss: 0.5099 - val_accuracy: 0.7433 - val_loss: 0.5235 Epoch 7/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7448 - loss: 0.5028 - val_accuracy: 0.7400 - val_loss: 0.5213 Epoch 8/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7457 - loss: 0.4973 - val_accuracy: 0.7444 - val_loss: 0.5190 Epoch 9/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.7476 - loss: 0.4929 - val_accuracy: 0.7396 - val_loss: 0.5161 Epoch 10/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 4ms/step - accuracy: 0.7505 - loss: 0.4888 - val_accuracy: 0.7396 - val_loss: 0.5156 Epoch 11/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.7544 - loss: 0.4862 - val_accuracy: 0.7470 - val_loss: 0.5138 Epoch 12/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7578 - loss: 0.4835 - val_accuracy: 0.7440 - val_loss: 0.5132 Epoch 13/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7574 - loss: 0.4813 - val_accuracy: 0.7440 - val_loss: 0.5120 Epoch 14/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7610 - loss: 0.4792 - val_accuracy: 0.7470 - val_loss: 0.5114 Epoch 15/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7620 - loss: 0.4767 - val_accuracy: 0.7459 - val_loss: 0.5103 Epoch 16/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.7615 - loss: 0.4751 - val_accuracy: 0.7466 - val_loss: 0.5103 Epoch 17/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7633 - loss: 0.4732 - val_accuracy: 0.7451 - val_loss: 0.5081 Epoch 18/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7633 - loss: 0.4715 - val_accuracy: 0.7481 - val_loss: 0.5081 Epoch 19/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7652 - loss: 0.4702 - val_accuracy: 0.7484 - val_loss: 0.5076 Epoch 20/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - accuracy: 0.7652 - loss: 0.4684 - val_accuracy: 0.7492 - val_loss: 0.5082 Epoch 21/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.7660 - loss: 0.4676 - val_accuracy: 0.7495 - val_loss: 0.5079 Epoch 22/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7669 - loss: 0.4658 - val_accuracy: 0.7462 - val_loss: 0.5072 Epoch 23/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7671 - loss: 0.4643 - val_accuracy: 0.7481 - val_loss: 0.5072 Epoch 24/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7675 - loss: 0.4626 - val_accuracy: 0.7459 - val_loss: 0.5090 Epoch 25/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7697 - loss: 0.4616 - val_accuracy: 0.7466 - val_loss: 0.5088 Epoch 26/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7689 - loss: 0.4606 - val_accuracy: 0.7451 - val_loss: 0.5101 Epoch 27/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7699 - loss: 0.4591 - val_accuracy: 0.7455 - val_loss: 0.5094 Epoch 28/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7722 - loss: 0.4581 - val_accuracy: 0.7448 - val_loss: 0.5100 Epoch 29/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7717 - loss: 0.4570 - val_accuracy: 0.7448 - val_loss: 0.5101 Epoch 30/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - accuracy: 0.7724 - loss: 0.4558 - val_accuracy: 0.7440 - val_loss: 0.5103 Epoch 31/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 3s 4ms/step - accuracy: 0.7734 - loss: 0.4551 - val_accuracy: 0.7433 - val_loss: 0.5101 Epoch 32/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7737 - loss: 0.4537 - val_accuracy: 0.7444 - val_loss: 0.5103 107/107 ━━━━━━━━━━━━━━━━━━━━ 1s 5ms/step - accuracy: 0.7383 - loss: 0.5330 Global Ultimate Test Accuracy: 0.745
107/107 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step
107/107 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
# Train and evaluate for Domestic Ultimate
set_seed()
train_and_evaluate(build_neural_network_v4(X_train_scaled.shape[1]),
X_train_scaled, y_train_dom, X_test_scaled, y_test_dom,
"Domestic Ultimate")
# Train and evaluate for Global Ultimate
set_seed()
train_and_evaluate(build_neural_network_v4(X_train_scaled.shape[1]),
X_train_scaled, y_train_glob, X_test_scaled, y_test_glob,
"Global Ultimate")
Epoch 1/100
/usr/local/lib/python3.11/dist-packages/keras/src/layers/core/dense.py:87: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(activity_regularizer=activity_regularizer, **kwargs)
340/340 ━━━━━━━━━━━━━━━━━━━━ 4s 3ms/step - accuracy: 0.6174 - loss: 0.6614 - val_accuracy: 0.6939 - val_loss: 0.5911 Epoch 2/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.6992 - loss: 0.5853 - val_accuracy: 0.7028 - val_loss: 0.5628 Epoch 3/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 4ms/step - accuracy: 0.7106 - loss: 0.5528 - val_accuracy: 0.7098 - val_loss: 0.5522 Epoch 4/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.7189 - loss: 0.5376 - val_accuracy: 0.7186 - val_loss: 0.5463 Epoch 5/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7261 - loss: 0.5286 - val_accuracy: 0.7238 - val_loss: 0.5424 Epoch 6/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7245 - loss: 0.5230 - val_accuracy: 0.7230 - val_loss: 0.5415 Epoch 7/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7304 - loss: 0.5180 - val_accuracy: 0.7260 - val_loss: 0.5393 Epoch 8/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7319 - loss: 0.5144 - val_accuracy: 0.7256 - val_loss: 0.5390 Epoch 9/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 2s 6ms/step - accuracy: 0.7306 - loss: 0.5113 - val_accuracy: 0.7300 - val_loss: 0.5376 Epoch 10/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 2s 5ms/step - accuracy: 0.7355 - loss: 0.5071 - val_accuracy: 0.7293 - val_loss: 0.5359 Epoch 11/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - accuracy: 0.7400 - loss: 0.5038 - val_accuracy: 0.7293 - val_loss: 0.5354 Epoch 12/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.7406 - loss: 0.5011 - val_accuracy: 0.7300 - val_loss: 0.5368 Epoch 13/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7416 - loss: 0.4985 - val_accuracy: 0.7319 - val_loss: 0.5349 Epoch 14/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7443 - loss: 0.4959 - val_accuracy: 0.7293 - val_loss: 0.5348 Epoch 15/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7455 - loss: 0.4948 - val_accuracy: 0.7297 - val_loss: 0.5357 Epoch 16/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7469 - loss: 0.4922 - val_accuracy: 0.7297 - val_loss: 0.5340 Epoch 17/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7486 - loss: 0.4900 - val_accuracy: 0.7319 - val_loss: 0.5343 Epoch 18/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7490 - loss: 0.4885 - val_accuracy: 0.7311 - val_loss: 0.5325 Epoch 19/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7536 - loss: 0.4865 - val_accuracy: 0.7315 - val_loss: 0.5335 Epoch 20/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7516 - loss: 0.4847 - val_accuracy: 0.7352 - val_loss: 0.5324 Epoch 21/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - accuracy: 0.7552 - loss: 0.4825 - val_accuracy: 0.7311 - val_loss: 0.5321 Epoch 22/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - accuracy: 0.7551 - loss: 0.4806 - val_accuracy: 0.7333 - val_loss: 0.5326 Epoch 23/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.7586 - loss: 0.4788 - val_accuracy: 0.7333 - val_loss: 0.5328 Epoch 24/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7588 - loss: 0.4778 - val_accuracy: 0.7330 - val_loss: 0.5340 Epoch 25/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7611 - loss: 0.4765 - val_accuracy: 0.7337 - val_loss: 0.5335 Epoch 26/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7618 - loss: 0.4746 - val_accuracy: 0.7333 - val_loss: 0.5344 Epoch 27/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7593 - loss: 0.4733 - val_accuracy: 0.7355 - val_loss: 0.5323 Epoch 28/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7634 - loss: 0.4722 - val_accuracy: 0.7348 - val_loss: 0.5349 Epoch 29/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7640 - loss: 0.4710 - val_accuracy: 0.7359 - val_loss: 0.5357 Epoch 30/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7643 - loss: 0.4694 - val_accuracy: 0.7355 - val_loss: 0.5328 Epoch 31/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - accuracy: 0.7670 - loss: 0.4671 - val_accuracy: 0.7344 - val_loss: 0.5342 107/107 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.7369 - loss: 0.5604 Domestic Ultimate Test Accuracy: 0.740
107/107 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
107/107 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step
Epoch 1/100
/usr/local/lib/python3.11/dist-packages/keras/src/layers/core/dense.py:87: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead. super().__init__(activity_regularizer=activity_regularizer, **kwargs)
340/340 ━━━━━━━━━━━━━━━━━━━━ 3s 3ms/step - accuracy: 0.6471 - loss: 0.6391 - val_accuracy: 0.6998 - val_loss: 0.5791 Epoch 2/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7088 - loss: 0.5723 - val_accuracy: 0.7230 - val_loss: 0.5485 Epoch 3/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7254 - loss: 0.5409 - val_accuracy: 0.7297 - val_loss: 0.5340 Epoch 4/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7335 - loss: 0.5229 - val_accuracy: 0.7392 - val_loss: 0.5258 Epoch 5/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7380 - loss: 0.5115 - val_accuracy: 0.7370 - val_loss: 0.5226 Epoch 6/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - accuracy: 0.7406 - loss: 0.5025 - val_accuracy: 0.7411 - val_loss: 0.5208 Epoch 7/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.7462 - loss: 0.4966 - val_accuracy: 0.7433 - val_loss: 0.5172 Epoch 8/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - accuracy: 0.7468 - loss: 0.4912 - val_accuracy: 0.7429 - val_loss: 0.5160 Epoch 9/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 3s 5ms/step - accuracy: 0.7506 - loss: 0.4862 - val_accuracy: 0.7425 - val_loss: 0.5150 Epoch 10/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 2s 5ms/step - accuracy: 0.7527 - loss: 0.4833 - val_accuracy: 0.7418 - val_loss: 0.5139 Epoch 11/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.7574 - loss: 0.4799 - val_accuracy: 0.7396 - val_loss: 0.5137 Epoch 12/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 3s 7ms/step - accuracy: 0.7568 - loss: 0.4768 - val_accuracy: 0.7425 - val_loss: 0.5140 Epoch 13/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 2s 7ms/step - accuracy: 0.7564 - loss: 0.4749 - val_accuracy: 0.7436 - val_loss: 0.5143 Epoch 14/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 2s 6ms/step - accuracy: 0.7592 - loss: 0.4715 - val_accuracy: 0.7455 - val_loss: 0.5132 Epoch 15/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 3s 8ms/step - accuracy: 0.7611 - loss: 0.4695 - val_accuracy: 0.7444 - val_loss: 0.5133 Epoch 16/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 3s 3ms/step - accuracy: 0.7627 - loss: 0.4682 - val_accuracy: 0.7462 - val_loss: 0.5129 Epoch 17/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7629 - loss: 0.4656 - val_accuracy: 0.7459 - val_loss: 0.5126 Epoch 18/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 4ms/step - accuracy: 0.7642 - loss: 0.4629 - val_accuracy: 0.7440 - val_loss: 0.5122 Epoch 19/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.7665 - loss: 0.4610 - val_accuracy: 0.7473 - val_loss: 0.5122 Epoch 20/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7717 - loss: 0.4586 - val_accuracy: 0.7444 - val_loss: 0.5131 Epoch 21/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7697 - loss: 0.4577 - val_accuracy: 0.7455 - val_loss: 0.5113 Epoch 22/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7717 - loss: 0.4557 - val_accuracy: 0.7462 - val_loss: 0.5116 Epoch 23/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7732 - loss: 0.4544 - val_accuracy: 0.7459 - val_loss: 0.5096 Epoch 24/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7740 - loss: 0.4520 - val_accuracy: 0.7473 - val_loss: 0.5148 Epoch 25/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7753 - loss: 0.4495 - val_accuracy: 0.7473 - val_loss: 0.5103 Epoch 26/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7747 - loss: 0.4488 - val_accuracy: 0.7462 - val_loss: 0.5119 Epoch 27/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 2s 4ms/step - accuracy: 0.7750 - loss: 0.4471 - val_accuracy: 0.7473 - val_loss: 0.5113 Epoch 28/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 4ms/step - accuracy: 0.7755 - loss: 0.4456 - val_accuracy: 0.7470 - val_loss: 0.5138 Epoch 29/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.7784 - loss: 0.4440 - val_accuracy: 0.7488 - val_loss: 0.5128 Epoch 30/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7798 - loss: 0.4431 - val_accuracy: 0.7495 - val_loss: 0.5184 Epoch 31/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.7793 - loss: 0.4426 - val_accuracy: 0.7466 - val_loss: 0.5158 Epoch 32/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 2s 5ms/step - accuracy: 0.7771 - loss: 0.4413 - val_accuracy: 0.7429 - val_loss: 0.5131 Epoch 33/100 340/340 ━━━━━━━━━━━━━━━━━━━━ 3s 6ms/step - accuracy: 0.7808 - loss: 0.4408 - val_accuracy: 0.7462 - val_loss: 0.5144 107/107 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.7478 - loss: 0.5287 Global Ultimate Test Accuracy: 0.751
107/107 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
107/107 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step
Model Evaluation¶
The following table shows a summary of performance of the respective model's predictions on test data.
The final Build Neural network model achieved an accuracy of 0.740 for the domestic domain and 0.751 for the global ultimate resolution, representing a notable improvement over the initial model's accuracies of 0.737 and 0.745, respectively. Compared to other models, the neural network outperformed both the random forest, which had accuracies of 0.702 and 0.712, and the decision tree, which achieved accuracies of 0.686 and 0.721. Logistic regression showed the lowest performance, with accuracies of 0.637 and 0.663 for the domestic and global ultimate resolutions, respectively. These results indicate that the neural network model provides the best predictive performance across the evaluated domains.
Conclusion¶
This project successfully developed a robust machine learning model to predict the "Is Domestic Ultimate" and "Is Global Ultimate" classifications for companies based on their operational, financial, and structural characteristics. By leveraging advanced data preprocessing, feature engineering, and model optimization techniques, we achieved accurate and reliable predictions that provide valuable insights into corporate ownership structures.
Our analysis highlighted the significance of key factors such as employee count, revenue, and industry classification in determining a company's hierarchical position. The use of ensemble learning methods, particularly XGBoost, proved to be highly effective in handling the complexity of the dataset while maintaining strong predictive performance. Additionally, addressing class imbalance and optimizing hyperparameters enhanced the model’s generalization capabilities.
These findings have practical implications for investors, regulators, and business strategists seeking to assess corporate autonomy and influence. Future work can explore deep learning models and additional data sources to refine the predictions further, ensuring greater accuracy and adaptability to evolving corporate structures. This study underscores the power of machine learning in transforming corporate analysis, offering a scalable and data-driven approach to understanding global business networks.